17.5 Intrinsic Methods

261

M[0, 0] = z

for each i in 1 .. S1.length

M[i,0] = f( M[i-1, 0 ], c(S1[i], "_" ) )

-- Boundary

for each j in 1 .. S2.length

M[0,j] = f( M[0, j-1], c("_", S2[j] ) )

-- conditions

for each i in 1 .. S1.length and j in 1 .. S2.length

M[i,j] = g(f(M[i-1, j-1], c(S1[i], S2[j])),

-- (mis)match

f(M[i-1, j ], c(S1[i], "_" )),

-- delete S1[i]

f(M[i,

j-1], c("_",

S2[j])))

-- insert S2[j]

Applied to sequence alignment, two varieties of DPA are in use: the Needleman–

Wunsch (“global alignment”) algorithm, which builds up an alignment starting with

easily achievable alignments of small subsequences, and the Smith–Waterman (“local

alignment”) algorithm that is similar in concept, except that it does not systematically

move through the sequences from one end to the other, but compares subsequences

anywhere.

It is often tacitly assumed that the sequences are random (i.e., incompressible),

but if they are not (i.e., they are compressible to some degree), this should be taken

into account.

There are also some heuristic algorithms (e.g., BLAST and FASTA) that are

faster than the DPAs. They look for matches of short subsequences, which may be

only a few nucleotides or amino acids long, that they then seek to extend. As with

the DPAs, some kind of scoring system has to be used to quantify matches.

Although sequence alignment has become very popular, some of the assump-

tions are quite weak and there is strong motivation to seek alternative methods for

evaluating the degree of kinship between sequences, not based on symbol-by-symbol

comparison; for example, one could evaluate the mutual information between strings

aa and bb (cf. Sects. 7.4.1 and 11.5):

upper I left parenthesis s Subscript a Baseline comma s Subscript b Baseline right parenthesis equals upper I left parenthesis s Subscript b Baseline comma s Subscript a Baseline right parenthesis equals upper I left parenthesis s Subscript a Baseline right parenthesis minus upper I left parenthesis s Subscript a Baseline vertical bar s Subscript b Baseline right parenthesis equals upper I left parenthesis s Subscript b Baseline right parenthesis minus upper I left parenthesis s Subscript b Baseline vertical bar s Subscript a Baseline right parenthesis periodI (sa, sb) = I (sb, sa) = I (sa)I (sa|sb) = I (sb)I (sb|sa).

(17.2)

Multiple alignment is an obvious extension of pairwise alignment.

17.5

Intrinsic Methods

The template or intrinsic approach involves constructing concise descriptions of

prototype objects and then identifying genes by searching for matches to such pro-

totypes. An elementary example is searching for motifs (i.e., short subsequences)

known to interact with particular drugs. The motif is often defined more formally

along the lines of a sequence of amino acids that itself defines a substructure in a

protein that can be connected in some way to protein function or structural stability